Detecting robocalls using biometric voice fingerprints

ABSTRACT

The disclosed system and method detect robocalls using biometric voice fingerprints. The system receives audio input representing a plurality of telephone calls. For at least a portion of the telephone calls, the system analyzes the received audio based on a voice biometrics detection model to identify one or more biometric indicators characterizing a speaker in the analyzed telephone call. The system generates and stores a voice fingerprint characterizing the speaker based on the biometric indicators, and a time of the analyzed telephone call. The system analyzes stored voice fingerprints and times corresponding to speakers in the analyzed telephone calls to determine a frequency of occurrence of each voice fingerprint within an analyzed timeframe. If the frequency of occurrence of a voice fingerprint exceeds a threshold call quantity within the analyzed timeframe, the voice fingerprint is characterized as being associated with a robocaller.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.17/086,284, filed Oct. 30, 2020, entitled “DETECTING ROBOCALLS USINGBIOMETRIC VOICE FINGERPRINTS,” which claims the benefit of U.S.Provisional Patent Application No. 62/928,222, filed Oct. 30, 2019,entitled “SPEAKER VOICE BIOMETRIC IDENTIFICATION FOR SPAM BLOCKING,”which are both incorporated herein by reference in their entireties.

BACKGROUND

Robocalls and other spam calls are a widespread issue in thetelecommunications space. These calls are often generated by humans ormachines (e.g., by using Text-To-Speech (TTS) to convert text torecorded audio) and subsequently injected into a telecommunicationssystem to mimic a human calling another party (e.g., an individual orbusiness). Robocalls are typically prerecorded so that they can beplayed repeatedly and in a high volume of phones calls placed to manyindividuals or businesses. As robocalls have become more frequent, theyare increasingly perceived as a nuisance because they (a) consume alarge amount of time from individuals or businesses that receive andfield the calls, (b) consume telephony network resources, and (c)increasingly are used for fraudulent purposes. Furthermore, certainrobocalls are illegal when improperly used to solicit business orgenerate a profit. Accordingly, there is a need to detect and removethese calls from the telecommunications space.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram illustrating an example environment in which avoice biometrics detection system operates.

FIG. 2 is a block diagram illustrating components of a voice biometricsdetection system.

FIGS. 3A and 3B are flow diagrams illustrating a process for identifyinga speaker in a phone call using voice fingerprinting.

FIG. 4 is a flow diagram illustrating a process for identifying arobocaller or spam caller in phone calls using voice fingerprinting.

DETAILED DESCRIPTION

A system and methods are disclosed for identifying robocallers and otherspam or undesirable callers that place calls to consumers or businessesover telecommunications systems. The system utilizes an ArtificialIntelligence (AI)-trained voice biometrics detection model to extractvoice biometrics (e.g., biometric indicators) of a speaker within aphone call. Utilizing the voice biometrics, the system generates a voicefingerprint that characterizes the speaker. The generated voicefingerprint may be used for multiple purposes by the system. The systemcan compare a generated voice fingerprint to stored datasets of knowncallers and caller type types (e.g., robocallers, spam callers,legitimate callers, etc.) to determine whether a particular call islegitimate or likely a robocaller or spam caller. The system can alsouse the generated voice fingerprint to monitor and detect a frequency ofa particular caller on a telecommunications network. If the frequency ofa detected caller exceeds certain thresholds, the system may categorizethe caller as a likely robocaller. In some implementations, thedisclosed system further takes corrective action based on identifying arobocaller or other spam caller. For example, when the systemdetermines, based on the voice fingerprint, that the speaker is arobocaller, spam caller, or other undesirable caller, the system mayterminate the call, display a warning, request a call recipient toconfirm that the call is spam, or take other corrective action.

To facilitate the detection of robocalls, the system generates a datasetof voice biometrics that characterize a plurality of known callers, andfurther generates a dataset of voice fingerprints based on the voicebiometrics. Call audio data that is analyzed by the system can containverbal speech and/or non-verbal speech patterns uttered by humans or bymachines configured to mimic or simulate the human voice. From theanalyzed call audio data, the system extracts unique characteristics foreach speaker that can be used to generate voice fingerprints (i.e., aprofile, signature, or set of characteristics that identifies orcharacterizes the speaker). Characteristics that identify orcharacterize a human speaker include, for example, volume, pitch,speaking rate, pauses between each utterance, tonal properties, etc.,that may be influenced, e.g., by the gender, age, ethnicity, language,and regional location of the speaker. The same characteristics alsoidentify or characterize audio simulating human speech, e.g., asproduced by a robocaller. Thus, the system uses characteristics of callaudio data in a phone call to generate a voice fingerprintcharacterizing a speaker (whether human or machine simulation), whichcan be used to detect that speaker in other phone calls.

As used herein, “identify,” with respect to a speaker, means that thesystem may detect that the same speaker is likely present in two or morephone calls or other audio inputs, whether or not the specific identityof the speaker is known. In other words, the system may detect thepresence of the same speaker in multiple audio sources by matching thevoice fingerprint of the speaker. The system can determine matchesbetween two or more voice fingerprints, for example, by calculating asimilarity score between the fingerprints. A match is found when thecompared speaker fingerprints are either exact matches or aresufficiently dose that the probability that they represent the samespeaker is very high (e.g., greater than 85%-90%). Thresholds formatching can be configurable or based on empirical data, such astraining data. By matching voice fingerprints, the system can identify aspeaker even though the spoken words or sentences may differ from speechused to generate a voice fingerprint because voice biometrics arelargely consistent with respect to a speaker. In other words, the systemcan extract and use biometrics to generate voice fingerprints thatidentify the same speaker regardless of the content of the receivedspeech or other audio information.

The system employs AI techniques, which may include artificial neuralnetworks, to identify voice biometrics characterizing a speaker. Thesystem receives live or recorded audio containing real or simulatedhuman speech, and extracts voice biometrics from the received audiousing AI models and data processing techniques. The extracted voicebiometrics are expressed or represented in various data formats orstructures, such as compressed and/or uncompressed data vectors orarrays. The AI data processing techniques include deep learningtechniques that use training data (e.g., audio data) to process,extract, learn, and identify unique characteristics and biometrics ofaudio data associated with a speaker (collectively, “biometrics”). Ifthe number of measured biometrics is sufficiently large, the combinationof biometrics associated with an individual speaker will be sufficientto identify that speaker in a subsequent audio sample with enoughaccuracy that the likelihood of confusion with other speakers is verysmall. The degree of accuracy can be based on, for example,semi-supervised training of the system, configuration of the system(e.g., for a level of accuracy that is acceptable to a user), orempirically derived thresholds. Based on the training data, the AI dataprocessing techniques generate voice biometrics detection models that,when applied to call audio data, identify and extract voice biometricsof speech in the analyzed call audio data. The extracted biometricsallow a speaker's speech to be compared with previously analyzed speechby comparing voice fingerprints generated based on extracted biometrics.In other words, the system uses AI data processing techniques andtraining data to generate models capable of identifying a speaker basedon a biometric-based voice fingerprint.

The system generates a dataset of voice fingerprints associated withknown speakers (i.e., known individuals each having a voice fingerprint)and classified into certain caller types (e.g., classified as spammers,robocallers, or known legitimate callers). To generate voicefingerprints of known speakers, the system captures or receivesutterances or other audio of known speakers. The system uses anAI-generated biometrics detection model to extract voice biometricsassociated with the known speakers from the captured or received audio.The system stores the extracted speaker biometrics in a known speakerbiometric dataset. In other words, the system creates and stores voicefingerprints associated with known speakers in the audio based onextracted voice biometrics. The stored fingerprints can be associatedwith a caller type, such as spammers, robocallers, known legitimatecallers, etc.

Depending on federal, state or local regulations, the voice fingerprintsmay be stored without personally identifiable information such that theyare not correlated with identifiable individuals (if human).Alternatively, the voice fingerprints may be stored for a limited amountof time for use in detecting spammers and robocallers, after which thefingerprints may be deleted. By limiting either the information storedwith biometric information or the length of storage, the system ensurescompliance with any privacy laws or other rules governing storage ofinformation characterizing telecommunications traffic.

The system may use a stored voice fingerprint to identify that audiowith characteristics matching the stored fingerprint is present on adifferent telephone call. Detection of a stored voice fingerprint inanother call (e.g., matching a voice fingerprint of a known speaker witha voice fingerprint of an unknown speaker) indicates that it is likelythe same speaker that is speaking on the other call. The generateddataset of known speaker fingerprints may be used for detecting unwantedcallers and, based on that detection, taking corrective steps such as“allowlisting” or “denylisting” phone numbers, requiring additionalverification or authentication steps, handling the call differently, andso on as described in additional detail herein.

In an example implementation of the system, the system receives audio(e.g., a recorded or live phone call that has not previously beenanalyzed by the system), and uses the AI-generated models to extractvoice biometrics from call audio data in the call and generate a voicefingerprint based on the extracted voice biometrics. The system searchesfor voice fingerprints in the known speaker dataset that match thegenerated voice fingerprint. For example, the system calculates aprobability that the generated voice fingerprint matches one or morevoice fingerprints stored in the known speaker dataset. Upon determininga match between the generated voice fingerprint from the call audio dataand one or more voice fingerprints stored in the known speaker dataset,the system determines that the speaker in the call audio data and theidentified speaker in the dataset of known speakers are the same.Because known speakers in datasets may also have been previouslyclassified by the system with a caller type, the system can use thatclassification (e.g., a robocaller, a spam caller, a legitimate caller,and so on) to manage interactions with the caller on the received audioor to take further steps based on the classification. For example, thesystem may take various actions based on this determination, e.g., torequest confirmation from a call recipient that the caller is of theknown caller type, and so on.

The system determines a match between two or more voice fingerprints bycalculating a similarity score indicating a degree of similarity ordissimilarity of the two or more voice fingerprints. To generate asimilarity score, the system employs various types of similaritymeasures, such as Euclidean similarity measures, probabilistic lineardiscriminant analysis (PLICA), and so forth. Based on the similaritymeasures, the system generates a similarity score. If the similarityscore exceeds a threshold score, then the system determines that thereis a match (i.e., that a speaker in received audio is the same as aspeaker corresponding to a stored voice fingerprint). The threshold canbe configurable, such as by a user, whereby the user can specify adegree of certainty to determine a match. In this and otherimplementations, the threshold can be empirically derived.

In some implementations, the described system can maintain differenttreatments associated with different caller types, such as an“allowlist” of known legitimate callers and a “denylist” of known spamor robocallers. The system can be configured to, for example,automatically block or flag denylisted callers and automatically allowor pass allowlisted callers. These and other treatments can bemaintained by the system, or generated by the system, e.g., based on theability of the system to identify known speakers using voice biometricidentification. An allowlist or denylist can track the identity ofcallers or speakers based on phone number, speaker voice fingerprints,or other identifiers associated with those callers or speakers.

A caller or speaker allowlist can, for example, include legitimaterobocallers or other frequent or repeat callers for which no correctiveaction is taken. One example of a robocaller that the system may allowis an automated messaging system used to notify clients or patients ofupcoming appointments, such as for dental or medical appointments. Toclassify such calls as legitimate, the system can add the speaker voicefingerprint associated with such calls to the caller allowlist. A calleror speaker allowlist can include phone numbers, voice fingerprints,and/or other identifying information to identify the speaker or caller.The system does not take corrective action upon confirming that a callor speaker in a call matches a speaker or caller included in anallowlist.

The system may also store a phone number or voice fingerprint or otheridentifier associated with known callers in a denylist. For example, thesystem may determine a speaker in a phone call to be associated with arobocaller. Based on this determination, the system may take correctiveaction on calls that are associated with that voice fingerprint or otheridentifier. The system may automatically take corrective action on allphone calls from a phone number or all phone calls that match a voicefingerprint or contain other identifier present in a denylist. Asdescribed elsewhere herein, corrective action may include blocking ordisconnecting the phone calls.

A phone number, voice fingerprint, or other identifier included in astored allowlist or denylist can later be removed from such list. Forexample, the system can remove a phone number or fingerprint based ontime (e.g., after a period of time has elapsed from when the phonenumber or fingerprint was added to the list). The system can also removea phone number or fingerprint based on the frequency the phone number isused to place calls or that the voice fingerprint appears in calls, asmeasured during a particular timeframe. In other words, the system canreassess speakers or callers placed on an allowlist or denylist basedon, e.g., the age of data used to originally place the speaker or calleron the list, lack of recent call data, changes in call frequency orother call behavior, or other factors. By continually or periodicallyreassessing whether speakers or callers have been appropriatelyclassified as being on an allowlist or denylist, the system attempts toapply an appropriate treatment of speakers and callers over time.Timeframes for reassessing allowlists or denylists can be configurableor empirically derived. For example, the system can be configured toreassess lists every 30 days, 60 days, 90 days, etc., based onpreferences or empirical information, e.g., showing a likely frequencyof reassessment that will detect callers to be classified on each listto an acceptable degree of accuracy.

Thus, the system and methods identify spam callers, robocallers, andother undesirable callers using voice biometrics, voice fingerprints,and AI data processing models to analyze real and simulated human speechand other call characteristics. Upon identifying the undesirable calleror callers, the system and methods can take corrective action such as bygenerating and sending a warning or other indication to a callrecipient, requesting confirmation from a call recipient that a call isspam, disconnecting a call, or requesting for a call recipient todisconnect a call. The system can also automatically block or flagdenylisted callers or automatically allow allowlisted callers.

Advantages of the system include improved ability to identify spam androbocallers using large datasets and AI data processing models. Forexample, the system and methods include automated processes foridentifying spam and robocallers and taking appropriate correctiveaction to respond to the callers (e.g., by blocking or disconnecting acall), thus, saving efforts that a business may otherwise spendresponding to spam and robocallers, reducing employee time spentresponding to robocalls, conserving telephony network resources thatwould otherwise be used by robocallers, and reducing the risk of fraudperpetrated by spam callers and robocallers. In addition, the systemincreases accuracy and reliability of robocaller detection, e.g., byrelying on a model trained using large datasets and checking foraccuracy using confirmation requests sent to call recipients.Furthermore, the system includes methods for identifying new, unknownrobocalls, e.g., by analyzing frequency of occurrence of voicefingerprints across telephone calls during one or more analyzed timeperiods (for example, to detect multiple, concurrent or near-concurrentcalls including the same speaker). By detecting robocallers using thedisclosed voice fingerprints, the system identifies robocallers evenwhen a caller takes measures to conceal its identity, e.g., by“spoofing” or blocking caller identification (“caller ID”).

One skilled in the art will appreciate that the system is not limited tothe described application or applications herein. For example, someimplementations of the system can automatically identify anddifferentiate between customers and agents (e.g., sales or customerservice representatives, and so on) on the same telephone call. In otherwords, the system can be applied to separate a caller channel and anagent channel in a telephone call using voice biometrics. As anadditional example, the system can identify or authenticate the identityof a caller to a call center, e.g., where the call center requirescaller authentication to disclose confidential or sensitive information.In the example implementation, the system can augment or replaceexisting methods of caller identity verification or authentication(e.g., the system can serve as an alternative to answering securityquestions or providing other identifying information).

Various embodiments of the invention will now be described. Thefollowing description provides specific details for a thoroughunderstanding and an enabling description of these embodiments. Oneskilled in the art will understand, however, that the invention may bepracticed without many of these details. Additionally, some well-knownstructures or functions may not be shown or described in detail, so asto avoid unnecessarily obscuring the relevant description of the variousembodiments. The terminology used in the description presented herein isintended to be interpreted in its broadest reasonable manner, eventhough it is being used in conjunction with a detailed description ofcertain specific embodiments of the invention.

Suitable Environments

FIG. 1 is a block diagram illustrating an environment 100 in which avoice biometrics detection system 115 operates. Although not required,aspects and implementations of the system may be described in thegeneral context of computer-executable instructions, such as routinesexecuted by a general-purpose computer, a personal computer, a server,or other computing system. The system can also be embodied in a specialpurpose computer or data processor that is specifically programmed,configured, or constructed to perform one or more of thecomputer-executable instructions explained in detail herein. Indeed, theterms “computer” and “computing device,” as used generally herein, referto devices that have a processor and non-transitory memory, like any ofthe above devices, as well as any data processor or any device capableof communicating with a network. Data processors include programmablegeneral-purpose or special-purpose microprocessors, programmablecontrollers, application-specific integrated circuits (ASICs),programmable logic devices (PLDs), or the like, or a combination of suchdevices. Computer-executable instructions may be stored in memory, suchas random access memory (RAM), read-only memory (ROM), flash memory, orthe like, or a combination of such components. Computer-executableinstructions may also be stored in one or more storage devices, such asmagnetic or optical-based disks, flash memory devices, or any other typeof non-volatile storage medium or non-transitory medium for data.Computer-executable instructions may include one or more programmodules, which include routines, programs, objects, components, datastructures, and so on that perform particular tasks or implementparticular abstract data types.

The system and methods can also be practiced in distributed computingenvironments, where tasks or modules are performed by remote processingdevices, which are linked through a communications network, such as aLocal Area Network (“LAN”), Wide Area Network (“WAN”) or the Internet.In a distributed computing environment, program modules or subroutinesmay be located in both local and remote memory storage devices. Aspectsof the system described herein may be stored or distributed on tangible,non-transitory computer-readable media, including magnetic and opticallyreadable and removable computer discs, stored in firmware in chips(e.g., EEPROM chips). Alternatively, aspects of the system may bedistributed electronically over the Internet or over other networks(including wireless networks). Those skilled in the relevant art willrecognize that portions of the system may reside on a server computer,while corresponding portions reside on a client computer.

In the environment 100, the voice biometrics detection system 115 isable to receive information associated with calls made by one or morecallers 110 (shown individually as capers 110 a-110 n) via one or morenetworks 105. The voice biometrics detection system 115 is also able toreceive information associated with one or more advertisers 112 (shownindividually as advertisers 112 a-112 n) via the one or more networks105. A caller 110 may be an individual person, whether operating in anindividual capacity or as part of a business, a governmental agency, orany other entity capable of initiating telephone calls for any reason,including calls initiated in response to advertisements for products orservices. A caller 110 may also be, for example, a robocaller or othercomputerized device for simulating human speech or transmitting recordedspeech. An advertiser 112 similarly may be an individual person, abusiness, a governmental agency, or any other entity capable ofreceiving telephone calls in response to advertisements that are placedby the advertiser. The voice biometrics detection system 115 receives anindication when telephone calls are made from the callers 110 to theadvertisers 112, either by directly monitoring to detect when a call ismade, by receiving recorded audio from a call concurrently during thecall or after the call has been completed, or by other process. Thesystem may process such calls (i.e., “received calls”) to determinevoice biometrics of speakers within a call, to assess probabilities ofwhether the call is spam (e.g., of whether the call is a robocall),and/or to take corrective action, if necessary, depending on the callassessment.

Networks 105 are any network suitable for communicatively coupling thecallers 110 and the advertisers 112, such as a Voice over InternetProtocol (VoIP) network, a cellular telecommunications network, apublic-switched telephone network (PSTN), any combination of thesenetworks, or any other suitable network that can carry data and/or voicetelecommunications. Networks 105 also allow information about callsbetween the callers 110 and advertisers 112, including the audioassociated with such calls, to be conveyed to voice biometrics detectionsystem 115.

The callers 110, advertisers 112, and voice biometrics detection system115 may also communicate with each other and with publishers 125 viapublic or private networks 105, including for example, the Internet. Thevoice biometrics detection system 115 may provide an interface such as awebsite or an application programming interface (API) that allows systemusers to access the voice biometrics detection system 115, and whichprovides data regarding the voice biometrics detection services andfunctions. The publishers 125 provide content that includes phonenumbers or other identifiers that allow callers to call advertisers. Theadvertisers may have dedicated phone numbers that are advertised topotential callers, or the advertisers may use transitory call trackingphone numbers provided from a call tracking system (not shown) to enablecallers to call advertisers.

The callers 110 and advertisers 112 may have mobile devices andcomputers that are utilized for communicating with each other and withthe publishers 125 through the network 105. Any mobile devices maycommunicate wirelessly with a base station or access point using awireless mobile telephone standard, such as the Global System for MobileCommunications (GSM), Long Term Evolution (LTE), or another wirelessstandard, such as IEEE 802.11, and the base station or access point maycommunicate with publishers 125 via the network 105. Computers maycommunicate through the network 105 using, for example, TCP/IPprotocols.

FIG. 2 is a block diagram illustrating various components of the voicebiometrics detection system 115. The voice biometrics detection system115 includes a storage area 230. The storage area 230 includes softwaremodules and data that, when executed or operated on by a processor,perform certain of the methods or functions described herein. Thestorage area may include components, subcomponents, or other logicalentities that assist with or enable the performance of some or all ofthese methods or functions. For example, the storage area includes an AItraining module 270 that uses a training dataset of known telephonecalls or other known audio to generate a voice biometrics detectionmodel for extracting voice biometrics of a speaker. The extracted voicebiometrics are used to generate voice fingerprints characterizingspeakers and differentiating between speakers. Additionally, the storagearea includes a call analysis module 275 that uses the voice biometricsdetection model to analyze a received call to identify (e.g., generate,extract, etc.) voice biometrics and generate voice fingerprints that areassociated with the received call. The call analysis module 275additionally determines a probability (e.g., by calculating a similarityscore) of whether an identified voice fingerprint matchespreviously-stored voice fingerprints, and/or determines a number oftimes a voice fingerprint appears in phone calls that occurredconcurrently or within a given amount of time. The storage area alsoincludes a corrective action module 280 to assess whether a determinedprobability of a match and/or whether the number of times a speakervoice fingerprint appears in phones exceeds one or more thresholds. Ifthe thresholds are exceeded, the corrective action module 280 takesappropriate corrective action such as by terminating a call, warning thecall recipient about the likelihood that the caller is a spam orrobocaller, providing the call recipient the opportunity to terminatethe call, and so on. The operation of training module 270, call analysismodule 275, and corrective action module 280 will each be described inmore detail with respect to FIGS. 3 and 4.

The voice biometrics detection system 115 stores data 255 a, 255 b . . .255 n that characterizes one or more speakers. Data characterizingspeakers can include raw audio data associated with each speaker, phonenumbers or other unique identifiers for each speaker, voice biometricsextracted from audio data, voice fingerprints generated from theextracted voice biometrics, voice fingerprints of known speakers, andcharacterizations of caller type (e.g., legitimate callers, spam orrobocallers, etc.) for each speaker. In some implementations, the voicebiometrics detection system 115 can discard raw audio and identifyinginformation of a caller after generating biometrics and fingerprints,and retain only biometrics and fingerprints for the caller forassociating the caller with a determined caller type, for example, toavoid storage of private or confidential information. In someimplementations, the voice biometrics detection system 115 can alsodiscard biometrics and fingerprints for the caller, for example, whenthe system is configured to only detect live robocalls. In suchimplementations, the voice biometrics detection system 115 generatesvoice fingerprints to detect concurrent or near-concurrent instances ofthe same speaker in multiple phone calls, but the system may not storethe generated voice fingerprints to detect the same caller in subsequent(i.e., non-concurrent) phone calls. Additionally, the voice biometricsdetection system can store one or more received telephone calls that areto be analyzed for spam or robocaller activity. Additional informationregarding the one or more sets of stored data 255 a, 255 b . . . 255 ncharacterizing the speakers is described in more detail with respect toFIGS. 3 and 4. A person of ordinary skill will appreciate that storagearea 230 may be volatile memory, non-volatile memory, a persistentstorage device (for example, an optical drive, a magnetic hard drive, atape of a tape library, etc.), or any combination thereof.

The voice biometrics detection system 115 further includes one or morecentral processing units (CPU) 200 for executing software stored in thestorage area 230, and a computer-readable media drive for readinginformation or installing software from tangible computer-readablestorage media, such as a floppy disk, a CD-ROM, a DVD, a USB flashdrive, and/or other tangible computer-readable storage media. The voicebiometrics detection system 115 also includes one or more of thefollowing: a network connection device 215 for connecting to a network,an information input device 220 (for example, a mouse, a keyboard,etc.), and an information output device 225 (for example, a display).

Voice Biometrics Detection by Comparison to Known Speakers

FIGS. 3A and 3B are flow diagrams illustrating processes 300 and 350 foridentifying a speaker in a phone call using voice fingerprinting,configured in accordance with various embodiments of the system. Thedisclosed processes may be used to detect a known robocall based on astored dataset of fingerprints associated with known robocalls orrobocallers. In some embodiments, all or a subset of the one or moreoperations of the process 300 can be performed by components of a voicebiometrics detection system.

Process 300 is executed by the system to generate a dataset of speakerfingerprints having an assigned caller type. At a block 305, the systemreceives audio of known speakers. The audio can be, for example,recorded phone calls or other audio data files. Each of the audio filesassociated with a speaker has an assigned caller type. For example, ifthe phone call or audio has been previously identified as spam (e.g., aknown robocall), the speaker is classified as a spammer. If the phonecall or audio has been previously identified as a legitimateconversation, the speaker is identified as a legitimate caller.

At a block 310, the system generates voice fingerprints by extracting(e.g., identifying, measuring, calculating, etc.) one or more voicebiometrics for speakers in the audio received at block 305. Call audiodata that is analyzed by the system can contain verbal and/or non-verbalspeech uttered by humans or by machines configured to mimic or simulatethe human voice. From the analyzed call audio data, the system extractscharacteristics that identify or characterize the speaker including, forexample, volume, pitch, speaking rate, pauses between each utterance,tonal properties, etc., that may be influenced, e.g., by the gender,age, ethnicity, language, and regional location of the speaker. In someembodiments, the system uses an AI-trained model to process and extractvoice biometrics from the audio.

To train a model to extract voice biometrics from the audio, the systemuses a dataset of known telephone calls as a training dataset. Forexample, the system can use a dataset of audio data (e.g., 200, 300,400, 500 hours of audio data, etc.) in the training process, the datasetincluding a variety of speakers and speech content, as well as live orrecorded audio and real or simulated human speech. Using traditional AIand neural net learning techniques, the system trains a voice biometricsdetection model to identify distinguishing voice biometrics from audio.After being trained, the biometrics detection model can be used toidentify biometrics that are used to characterize speakers from audio.

The biometrics detection model can be of any type, such as a universalbackground model (UBM), feed-forward (FF), long short-term Memory(LSTM), or any other model capable of generating voice biometrics. In anexample implementation, the system generates biometrics according to thefollowing equation:

v=F(o)

In this equation, v represents voice biometrics generated by the system,o represents spectral or cepstral features extracted from audio data,and F represents the biometrics detection model. In other words, theAI-trained biometrics detection model is applied to features, such asspectral or cepstral features, in audio data to generate voicebiometrics that characterize a speaker in the audio. The voicebiometrics are used to define a voice fingerprint for each speaker.

The system utilizes extracted biometrics for each speaker to generatespeaker fingerprints. Typically, the system is able to identifysufficient biometrics from several seconds (e.g., 3-5 or more seconds)of received audio to characterize speakers, although a greater or lesseramount of audio data may be required for identification. Once extracted,the system generates voice fingerprints for each speaker as compressedand/or uncompressed data vectors or arrays of one or more voicebiometrics. The system represents voice fingerprints usinghigh-dimension vectors, in which each dimension can be represented as afloat or double-precision floating point number. Vectors and valuesassociated with vectors are associated with various characteristics thatcan be used to identify individual speakers in received audio. Thesecharacteristics can include, for example, pitch, speaking rate, volume,pauses between utterances, etc. Because the specific characteristics aretrained in the neural network, however, the correlation between eachcharacteristic and vector value is hidden by the system. Notably,however, a comparison of similarities between speakers can be made bycalculating a difference between two or more fingerprint vectors.

At a block 315, the system stores the voice biometrics and/or the voicefingerprints extracted at block 310. In some embodiments, the systemstores the voice biometrics and/or the voice fingerprints in one or moreknown voice biometrics datasets. Entries in the dataset associate a setof voice biometrics and/or a voice fingerprint with a speaker and acaller type for that speaker, such as spam caller, robocaller, orlegitimate caller. In some embodiments, the dataset also includes atreatment for a particular caller, such as adding them to an “allowlist”to indicate that calls associated with that caller should be allowed toconnect with a call recipient, or adding the caller to a “denylist” toindicate that calls associated with the caller should be blocked. Thefollowing table provides an example format of a stored characterizationfor each known speaker:

Voice Speaker ID Fingerprint Caller Type Date Added Treatment Speaker A<vector A> spammer Mar. 16, 2020 denylist Speaker B <vector B>legitimate Mar. 17, 2020 allowlist caller Speaker C <vector C>robocaller Mar. 17, 2020 allowlistIt will be appreciated that the caller type does not always dictate thetype of treatment for that caller. For example, although Speaker C isidentified as a robocaller, the system has elected to treat Speaker C asan allowed caller because it is associated with a service that isconsidered to be a legitimate robocaller service (e.g., a dental servicewith reminder calls for appointments).

The dataset generated by blocks 305-315 can be modified over time, asnew known speakers are added to the dataset, speakers are removed fromthe dataset, or the treatment of a speaker changes over time.

Once the dataset of known callers has been generated, the system 115 canuse the dataset to take corrective action with respect to newly-receivedcalls. FIG. 3B is a flow chart of a process 350 implemented by thesystem to process new calls. At a block 355, the system receives one ormore phone calls or audio files to monitor. The one or more phone callsor audio files can be concurrently received by the system while a callis happening, allowing the system to analyze the call during thependency of the call itself. Alternatively, a recorded copy of the oneor more phone calls or audio files can be received such that the systemanalyzes the phone calls or audio files after a call has ended. Wheninitially received, each of the one or more phone calls or audio filesare unknown speakers. The system analyzes the received phone calls forindications that a caller to the individual or a business is arobocaller, spam caller, or other undesirable caller.

At a block 360, the system generates voice fingerprints for each speakerin the audio received at block 355 by extracting one or more voicebiometrics characterizing each speaker. Because users of the system areprimarily concerned with the identity of the calling party (and not theidentity of the recipient of the call), the system typically generatesspeaker voice fingerprints for the calling speaker and ignores the audioassociated with the called party. In other cases, the system generatesspeaker voice fingerprints for both the calling party as well as therecipient. The system generates voice fingerprints in a manner similarto the process described herein at block 310, according to the voicebiometrics detection model(s) generated by the system. The fingerprintmay be generated during the pendency of a call (e.g., by generating aspeaker voice fingerprint in seconds or minutes, while a caller is stillon the line).

At a block 365, the system computes one or more probabilities, such asby calculating a similarity score, that a voice fingerprint generated ofthe unknown caller at block 360 matches a stored voice fingerprint of aknown caller. The stored voice fingerprint may be associated with, forexample, a known spam caller, a known robocaller, a known legitimatecaller, etc. Additionally, the known caller may be on the allowlist orthe denylist. In some embodiments, the system searches a datasetcomprising voice fingerprints for known callers and/or voice biometricsof known speakers for potential matches to the voice fingerprintgenerated at block 360. The system can find closely matchingfingerprints by calculating a distance between fingerprint vectors usingany common mathematical technique such as cosine similarity, Euclideandistance, Mahalanobis distance, probabilistic linear discriminantanalysis (PLDA), etc., and identifying vectors with the least distance.The system identifies a subset of voice fingerprints stored in thedataset that are potential matches and computes a probability of matchfor each speaker voice fingerprint in the subset. In other embodiments,the system computes a probability of match for every speaker voicefingerprint stored in the dataset.

At a decision block 370, the system compares the one or moreprobabilities computed at block 365 to a threshold. The threshold canrepresent a confidence level above which the system identifies a matchbetween a speaker voice fingerprint generated at block 360 and a storedspeaker voice fingerprint associated with a known caller. Examplethresholds include 75%, 80%, 90%, 95%, 98%, 99%, and 100%, among others.Thresholds can be configurable, based on semi-supervised training of thesystem, and/or empirically determined, such that the thresholddifferentiates between speakers to an acceptable degree of accuracycorresponding to the threshold. When the calculated probability exceedsthe threshold, the system treats the newly-received voice fingerprint ashaving matched the previously-identified known speaker voicefingerprint.

If the system determines that a probability computed at block 365 meetsor exceeds the threshold at block 370, then the system concludes thatthe speaker corresponding to the voice fingerprint generated at block360 matches the known speaker corresponding to the stored speaker voicefingerprint. In other words, the system determines that the speakerassociated with the voice fingerprint generated at block 360 is likelythe same as a particular known speaker represented in the matching voicefingerprint.

At decision block 375, the system determines whether a corrective actionis needed based on the identified known speaker. If the identified knownspeaker is a legitimate caller, for example, and on an allowlist, nocorrective action is needed to be taken. In that case, processingterminates. In the event that the known speaker corresponding to thevoice fingerprint stored at block 315 is a known spam caller orrobocaller, however, and on a denylist, processing continues to block380.

At block 380 the system takes an appropriate corrective action dependingon the identity of the known speaker and the system settings. Correctiveaction can include (a) generating, transmitting, and/or displaying anaudio or visual warning or notification to a party of the phone callthat the call is spam, (b) automatically disconnecting the phone call,or (c) requesting the receiving party for authorization to disconnectthe phone call. For example, the system can transmit an audio warning ordisplay a visual warning on a screen to warn a call recipient that he orshe is likely interacting with a spam caller, robocaller, etc. Thesystem can also, for example, automatically disconnect the phone callupon determining or confirming that the call is a spam call, robocall,etc. In some embodiments, the system can transmit the call recipient amessage indicating that the call is likely a spam caller, robocaller,etc. and requesting permission to disconnect the call. The message maybe transmitted to the call recipient via a message in a graphical userinterface (GUI), a message sent via a service or protocol (e.g., textmessage, Short Message Service (SMS), Rich Communication Service (RCS),etc.), and so forth. In response to the sent message, the systemreceives a message from the call recipient that either confirms that thecall should be disconnected or indicates that the call should be allowedto proceed.

If the system determines that a probability computed at block 365 doesnot meet or exceed the threshold at block 370, then the system concludesthat the speaker corresponding to the voice fingerprint is stillindeterminate. That is, the system is unable to associate the voicefingerprint with a known caller. In that case, processing continues toblock 385 where the system takes a monitoring action. A monitoringaction can include sending a caller confirmation request to a callrecipient following a call and requesting that the call recipientcharacterize the call with a caller type (e.g., legitimate caller, spamcaller, robocaller, etc.). In some embodiments, corrective actionincludes generating and transmitting to a call recipient a callerconfirmation request, such as a robocaller confirmation request. Thecaller confirmation request informs a call recipient that a speaker in atelephone call is likely associated with a known caller type (e.g., spamcaller, legitimate caller, robocaller, etc.), and requests confirmationfrom the call recipient that the caller is of the known caller type. Thesystem can transmit the caller confirmation request to the callrecipient via a message in a graphical user interface (GUI), a messagetransmitted via a service or protocol (e.g., text message, Short MessageService (SMS), Rich Communication Service (RCS), etc.), an email, and soforth In response to the caller confirmation request, the systemreceives a message from the call recipient that either confirms ordenies that the caller is of the identified caller type. The callrecipient may provide the return message to the system by selecting acontrol within the presented GUI of the original caller confirmationrequest, by sending a responsive text, SMS, or RCS communication, bysending a responsive email, and so forth. In response to the callerconfirmation request, the system thereby receives an indication from thecall recipient with an appropriate caller type.

In some embodiments, the monitoring action can be taken depending on theproximity of the probability computed at block 365 to the threshold atblock 370. If the computed probability is close to, but not above thethreshold, there is a greater likelihood that the corresponding callermay be a robocaller or spam call. In that case, the monitoring actioncan be taken by the system to confirm the caller type with the callrecipient. In contrast, if the computed probability is very low at block365, the likelihood that the corresponding caller is a robocaller orspam caller is remote. In that case, the system may take no monitoringaction.

In some embodiments, the monitoring action can include analyzinginteractions associated with the caller voice fingerprint acrossmultiple calls or multiple channels of a call. The system can analyzereceived audio input for various information and data such as a durationthat a speaker talks in the audio input, data from two or more channelsof a phone call (e.g., whether multiple speakers or callers on differentchannels of the call interact with one another, such as a customer andagent and/or other characteristics of audio and voice signalpattern-based analysis). When audio of a phone call is recorded and/ortranscribed, the information and data can be generated via naturallanguage processing (NLP) and/or natural language understanding (NLU)and used to detect real conversation (e.g., conversation that includesboth sides on the phone call engaged in meaningful discussion and/orabout meaningful topics). For example, the system can determine that acaller is legitimate when the system detects that the call recipientinteracts with the caller for an extended period, e.g., by having ininteractive conversation, responding to questions or prompts, orotherwise responding to the call. In contrast, the system can identify acall as illegitimate, for example, if the call recipient does notinteract with the caller (e.g., immediately disconnects the call withoutspeaking or otherwise responding to the caller).

Based on the monitoring actions, the system can assign a caller type anda treatment for the caller type to the caller voice fingerprint. Thatis, the system can create a new known caller entry in the maintaineddataset generated in block 315. Once a caller has been added to theknown caller dataset, the system can treat future calls having a voicefingerprint matching the stored voice fingerprint in accordance with thecorrective actions described herein.

Although the operations of the processes 300 and 350 are discussed andillustrated in a particular order, the processes are not so limited. Insome embodiments, the processes may perform operations in a differentorder than described herein. Furthermore, a person skilled in the artwill readily recognize that the processes can be altered and stillremain within these and other embodiments of the system. For example,one or more operations illustrated in FIGS. 3A and 3B can be omittedfrom and/or repeated within the processes in some embodiments.

Additional or alternative operations not depicted in FIG. 3 can beincluded in the example process 300 in accordance with variousembodiments of the system. For example, the system can take into accountthe age of the analyzed data in determining whether a caller voicefingerprint should be added to the “allowlist” or “denylist.” Forexample, older analyzed data can be weighted less than newer analyzeddata when assigning a characterization to a particular voicefingerprint. Additionally, the system can take into account the lengthof time that a particular voice fingerprint has been on the allowlist ordenylist. On a periodic basis, calls associated with a voice fingerprintcan be reassessed to ensure that the voice fingerprint continues to beassociated with behaviors consistent with the applied characterization.In other words, the system can update a speaker or caller “denylist” or“allowlist” from time to time to remove voice fingerprints from eitherlist.

Furthermore, the process 300 can take into account additional oralternative factors in identifying unknown callers and/or takingcorrective without deviating from the teachings of the presentdisclosure. For example, an unknown speaker can be identified, in part,based on other identifying information such as a phone number or otheridentifier associated with a caller, speaker, or user.

Voice Biometrics Identification By Comparing Speakers Across MultipleCalls

FIG. 4 is a flow diagram illustrating a process 400 executed by thesystem for identifying a robocaller or spam caller in phone calls usingvoice fingerprinting. The disclosed process detects new or unknownrobocallers based on frequency of detection of a common voicefingerprint over one or more analyzed time periods. In some embodiments,all or a subset of the one or more steps of the process 400 can beperformed by components of the voice biometrics detection system.

At a block 405, the system receives a set of phone calls to analyze. Thephone calls can be “live,” such that the process 400 monitors the audiosignal of each phone call and analyzes the call while the call ishappening. Alternatively or additionally, the phone calls can bereceived as recorded audio files such that the process 400 processeseach phone call in a delayed fashion (e.g., with a time delay, butduring the pendency of a call) or each call after the call has ended.The phone calls can be phone calls that occur concurrently or within ashort time period (e.g., within a few seconds or minutes) of oneanother. In these and other embodiments, the phone calls can be phonecalls that occur concurrently or within a longer time period (e.g.,within several minutes, hours, days, weeks, etc.) of one another. Thephone calls can be phone calls of known and/or unknown speakers.

At a block 410, the system generates voice fingerprints by extracting(e.g., identifying, measuring, calculating, etc.) one or more voicebiometrics characterizing speakers in the received audio. Phone callstypically have two channels, one associated with the caller and theother associated with the called party. In some embodiments, the process400 generates voice fingerprints of speaking parties on only one channelof the phone call (e.g., on only the caller side). The system typicallyfocuses its analysis on the caller since the called party is typically aknown individual. Voice fingerprints are generated using the voicebiometrics detection model(s) generated by the system. The systemexpresses generated voice fingerprints as compressed and/or uncompresseddata vectors or arrays of one or more voice biometrics, as describedherein. After generating a voice fingerprint, the system stores thegenerated voice fingerprint in association with one or more time stampsreflecting a start time of the call, an end time of the call, or boththe start and end time of the call. The voice fingerprint andcorresponding timestamps are stored by the system in a dataset ordatabase.

At a block 415, the system selects a short time period and correspondingset of received calls to analyze. For example, the system can elect toanalyze all calls received within a one-minute period, five-minuteperiod, 15-minute period, an hour period, etc. Using time stampsassociated with the voice fingerprints, the system identifies all callsthat fall within the selected short time period. Once the calls areidentified, the system determines the number of times that each voicefingerprint is detected during the selected period. The operation ofblock 415 is used to identify when a material number of calls includethe same speaker during the selected period of time. For example, theoperation at block 415 can detect if the same speaker is present indozens, hundreds, thousands of calls per minute, per hour, etc. Byreviewing calls within a selected time period, the system can detect ifthere are multiple occurrences of the same voice fingerprint at or nearthe same time. For example, the detection of the same voiceprint at thesame time on multiple phone calls is indicative that the speaking partyis likely a robocaller or other simulated caller.

At a block 420, the system determines whether the number of times eachvoice fingerprint appears in a selected short time period, as determinedat block 415, exceeds a first threshold. In some embodiments, the firstthreshold represents a maximum number of phone calls a caller mightlegitimately place within the selected period of time. Example firstthresholds include two calls, three calls, five calls, ten calls, etc.that occur within a few seconds or minutes. If the system determinesthat the number meets or exceeds the first threshold, the systemdesignates the corresponding calls as likely spam and the calleridentified by the voice fingerprint as a likely robocaller (e.g., that abatch tool or auto-dialer was used to generate robocalls). If the numberexceeds the threshold at decision block 420, processing continues toblock 435 where the system takes corrective action. Otherwise, theprocessing continues to block 425. The first threshold associated withthe short time period represents a number of calls beyond which it isnot possible or likely that the calls are placed by a single person. Theshort time period and first threshold can be adjusted by the systemaccording to an empirically determined threshold. As one example, athreshold of 5, 10, 20, or 30 calls may be associated with a short timeperiod of one minute. If a number of calls associated with the samevoice fingerprint exceeds this threshold for the short time period, thenthe system determines that the caller associated with the voicefingerprint is a robocaller (e.g., because the calls are generated froma recording, using a computer, or using simulated speech, etc.).

At a block 425, the system selects a long time period and correspondingset of received calls to analyze. For example, the system may elect toanalyze all calls received within an hour period, a 24-hour period, aweek, etc. Using time stamps associated with the voice fingerprints, thesystem identifies all calls that fall within the selected long timeperiod. Once the calls are identified, the system determines the numberof times that each voice fingerprint is detected during the selectedperiod. The operation of block 425 is used to identify when a materialnumber of calls include the same speaker over a longer time frame. Forexample, the operation at block 415 can detect if the same speaker ispresent in hundreds or thousands of calls per day or week.

At a block 430, the process 400 determines if the number of times eachvoice fingerprint appears in a selected long time period, as determinedat block 425, exceeds a second threshold. The second thresholdrepresents a maximum number of phone calls a human caller mightlegitimately place within the longer period of time. Example secondthresholds include 25 calls, 100 calls, 250 calls, etc. within severalhours, days, weeks, etc. For example, the longer time period of 5 daysmay be associated with a second threshold of 1000 calls, indicating thata number of calls beyond this threshold are likely robocalls (e.g.,generated from recordings, computers, using simulated speech, etc.). Thesystem can adjust the second threshold depending on the characteristicsof the observed traffic. If the system determines that the number meetsor exceeds the second threshold, system designates the correspondingcalls as likely spam and the caller identified by the voice fingerprintas a likely robocaller (e.g., that a batch tool or auto-dialer was usedto generate robocalls). If the number exceeds the second threshold atdecision block 430, processing continues to block 435 where the systemtakes corrective action. Otherwise, the processing continues to block440.

At a block 435, the system takes corrective action. In some embodiments,the system takes corrective action by adding the voice fingerprints,corresponding phone numbers or other identifiers from calls that met orexceeded the first threshold or second thresholds to the denylist. Thatis, the system disconnects current phone calls (when a likely robocalleris detected during the call) or blocks future phone calls associatedwith voice fingerprints, phone numbers, or other identifiers from callsthat met or exceeded the first or second thresholds. As describedherein, corrective action can include generating a warning or indicationto a user, providing a user the opportunity to terminate a call,automatically terminating or blocking a call, and so on. In someembodiments, the system may forgo corrective action, e.g., if a speakeris detected as being associated with a robocaller with a legitimatepurpose (e.g., appointment reminders, and so forth).

If corrective action is not taken for a particular voice fingerprint, ata block 440 the system can add the voice fingerprints, correspondingphone numbers, or other identifiers associated with calls that did notmeet or exceed the first and second thresholds to the allowlist. Thatis, the system will allow current or future phone calls associated withvoice fingerprints, phone numbers, or other identifiers for which callquantities do not exceed the first and second thresholds to proceed inan unobstructed fashion.

Although the operations of the process 400 are discussed and illustratedin a particular order, the process 400 is not so limited. In someembodiments, the process 400 may perform operations in a differentorder. For example, the process 400 may perform blocks 425 and/or 430before, during, and/or after performing blocks 415 and/or 420.Furthermore, a person skilled in the art will readily recognize that theprocess 400 can be altered and still remain within these and otherembodiments of the system. For example, one or more operations (e.g.,blocks 415 and 420, and/or blocks 425 and 430) illustrated in FIG. 4 canbe omitted from the process 400.

Unless the context clearly requires otherwise, throughout thedescription and the claims, the words “comprise,” “comprising,” and thelike are to be construed in an inclusive sense, as opposed to anexclusive or exhaustive sense; that is to say, in the sense of“including, but not limited to.” As used herein, the terms “connected,”“coupled,” or any variant thereof means any connection or coupling,either direct or indirect, between two or more elements; the coupling orconnection between the elements can be physical, logical, or acombination thereof. Where the context permits, words in the DetailedDescription using the singular or plural number may also include theplural or singular number respectively.

From the foregoing, it will be appreciated that specific embodiments ofthe invention have been described herein for purposes of illustration,but that various modifications may be made without deviating from thescope of the invention. Accordingly, the invention is not limited exceptas by the appended claims.

1-20. (canceled)
 21. A method performed by a computing system toidentify a known caller in a received call using voice biometrics, themethod comprising: receiving call audio for a call, the call audiocontaining real or simulated human speech of a speaker in the callaudio; generating, using a voice biometrics detection model, a biometricvoice fingerprint for the speaker in the call audio, wherein thegenerated biometric voice fingerprint is based on multiple biometricindicators extracted from the call audio and is stored as a dimensionalvector; comparing the generated biometric voice fingerprint to at leastsome biometric voice fingerprints stored as dimensional vectors in a setof biometric voice fingerprints associated with known callers;calculating a probability that the speaker in the call audio is a knowncaller based on a comparison between the generated biometric voicefingerprint and a biometric voice fingerprint in the set of biometricvoice fingerprints; and causing performance of an action depending onthe calculated probability that the speaker in the call audio is theknown caller, wherein the action includes allowing the call to proceed,generating an audio or visual warning associated with the call,generating a confirmation request to confirm an identity of the speaker,or terminating the call.
 22. The method of claim 21, wherein themultiple biometric indicators extracted from the call audio include atleast one of volume, speaking rate, pitch, length of pauses, or durationof pauses.
 23. The method of claim 21, wherein the voice biometricsdetection model is generated based on one or more artificialintelligence (AI) speech data processing models.
 24. The method of claim21: wherein at least some of the known callers in the set of biometricvoice fingerprints are each associated with a caller type, the callertype including a robocaller, a spam caller, or a legitimate caller, andwherein calculating a probability that the speaker in the call audio isthe known caller includes calculating a probability of a caller type forthe speaker.
 25. The method of claim 21, wherein calculating aprobability that the speaker in the call audio is a known callerincludes calculating a similarity between the generated biometric voicefingerprint and the biometric voice fingerprint in the set of biometricvoice fingerprints.
 26. The method of claim 25, wherein calculating asimilarity comprises calculating a distance between the generatedbiometric voice fingerprint dimensional vector and the biometric voicefingerprint dimensional vectors in the set of biometric voicefingerprints.
 27. The method of claim 21, wherein the set of biometricvoice fingerprints associated with the known callers includes at leastone biometric voice fingerprint determined to be associated with arobocaller based on a frequency of occurrence of the at least onebiometric voice fingerprint in a dataset comprising multiple voicefingerprints for callers detected in calls placed via a network duringan analyzed timeframe.
 28. The method of claim 21, wherein the callaudio includes a caller channel and a called channel, and wherein themultiple biometric indicators are extracted from the caller channel. 29.The method of claim 21, wherein the audio or visual warning is anotification of the identification of the speaker.
 30. The method ofclaim 21, wherein the confirmation request is delivered via a graphicaluser interface (GUI), a text message, or an email.
 31. A non-transitorycomputer-readable medium carrying instructions that, when executed by acomputing system, cause the computing system to perform operations toidentify a known caller in a received call using voice biometrics, theoperations comprising: receiving call audio for a call, the call audiocontaining real or simulated human speech of a speaker in the callaudio; generating, using a voice biometrics detection model, a biometricvoice fingerprint for the speaker in the call audio, wherein thegenerated biometric voice fingerprint is based on multiple biometricindicators extracted from the call audio and is stored as a dimensionalvector; comparing the generated biometric voice fingerprint to at leastsome biometric voice fingerprints stored as dimensional vectors in a setof biometric voice fingerprints associated with known callers;calculating a probability that the speaker in the call audio is a knowncaller based on a comparison between the generated biometric voicefingerprint and a biometric voice fingerprint in the set of biometricvoice fingerprints; and causing performance of an action depending onthe calculated probability that the speaker in the call audio is theknown caller.
 32. The non-transitory computer-readable medium of claim31, wherein the action includes allowing the call to proceed.
 33. Thenon-transitory computer-readable medium of claim 31, wherein the actionincludes terminating the call, generating a confirmation request toconfirm an identity of the speaker, or both.
 34. The non-transitorycomputer-readable medium of claim 33, wherein the confirmation requestis delivered via a graphical user interface (GUI), a text message, or anemail.
 35. The non-transitory computer-readable medium of claim 31,wherein the action includes generating an audio or visual warningassociated with the call.
 36. The non-transitory computer-readablemedium of claim 31, wherein the multiple biometric indicators extractedfrom the call audio include at least one of volume, speaking rate,pitch, length of pauses, or duration of pauses.
 37. The non-transitorycomputer-readable medium of claim 31, wherein the voice biometricsdetection model is generated based on one or more artificialintelligence (AI) speech data processing models.
 38. The non-transitorycomputer-readable medium of claim 31, wherein calculating a probabilitythat the speaker in the call audio is a known caller includescalculating a similarity between the generated biometric voicefingerprint and the biometric voice fingerprint in the set of biometricvoice fingerprints.
 39. The non-transitory computer-readable medium ofclaim 31, wherein the set of biometric voice fingerprints associatedwith the known callers includes at least one biometric voice fingerprintdetermined to be associated with a robocaller based on a frequency ofoccurrence of the at least one biometric voice fingerprint in a datasetcomprising multiple voice fingerprints for callers detected in callsplaced via a network during an analyzed timeframe.
 40. Thenon-transitory computer-readable medium of claim 31, wherein the callaudio includes a caller channel and a called channel, and wherein themultiple biometric indicators are extracted from the caller channel. 41.The non-transitory computer-readable medium of claim 31: wherein atleast some of the known callers in the set of biometric voicefingerprints are each associated with a caller type, the caller typeincluding a robocaller, a spam caller, or a legitimate caller, andwherein calculating a probability that the speaker in the call audio isthe known caller includes calculating a probability of a caller type forthe speaker.